Data Wrangling
Import your data
Clean your data
Explore relational data
Source: R4DS
https://data.worldbank.org/indicator/NY.GDP.MKTP.CD
Official source of global and historical GDP data
A very common control variable for IR and CP analysis
Very frustratingly messy!
Introducing here::here()
Points to where you are on your computer. Updates for everyone on any computer!
library(tidyverse)
gdp_raw <- read_csv(here::here("slides", "data-raw", "wb_gdp.csv"))
head(gdp_raw)# A tibble: 6 × 3
`Data Source` `World Development Indicators` ...3
<chr> <chr> <chr>
1 Last Updated Date 2023-03-01 <NA>
2 Country Name Country Code "Indicator Name,In…
3 Aruba ABW "GDP (current US$)…
4 Africa Eastern and Southern AFE "GDP (current US$)…
5 Afghanistan AFG "GDP (current US$)…
6 Africa Western and Central AFW "GDP (current US$)…
Read the ?read_csv help file. What arguments does this function take?
Head to the readr package documentation and find what other file types you can read in.
Source: World Bank
gdp_raw <- read_csv(here::here("slides", "data-raw", "wb_gdp.csv"), skip = 4, col_select = 1:66)
head(gdp_raw)# A tibble: 6 × 66
Country…¹ Count…² Indic…³ Indic…⁴ `1960` `1961` `1962` `1963` `1964`
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aruba ABW GDP (c… NY.GDP… NA NA NA NA NA
2 Africa E… AFE GDP (c… NY.GDP… 2.13e10 2.18e10 2.37e10 2.82e10 2.61e10
3 Afghanis… AFG GDP (c… NY.GDP… 5.38e 8 5.49e 8 5.47e 8 7.51e 8 8.00e 8
4 Africa W… AFW GDP (c… NY.GDP… 1.04e10 1.11e10 1.19e10 1.27e10 1.38e10
5 Angola AGO GDP (c… NY.GDP… NA NA NA NA NA
6 Albania ALB GDP (c… NY.GDP… NA NA NA NA NA
# … with 57 more variables: `1965` <dbl>, `1966` <dbl>, `1967` <dbl>,
# `1968` <dbl>, `1969` <dbl>, `1970` <dbl>, `1971` <dbl>, `1972` <dbl>,
# `1973` <dbl>, `1974` <dbl>, `1975` <dbl>, `1976` <dbl>, `1977` <dbl>,
# `1978` <dbl>, `1979` <dbl>, `1980` <dbl>, `1981` <dbl>, `1982` <dbl>,
# `1983` <dbl>, `1984` <dbl>, `1985` <dbl>, `1986` <dbl>, `1987` <dbl>,
# `1988` <dbl>, `1989` <dbl>, `1990` <dbl>, `1991` <dbl>, `1992` <dbl>,
# `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, `1996` <dbl>, `1997` <dbl>, …
Taking a look at this data set:
What types of data do we have? Are they the right type of data?
Are we missing data points?
Source: R4DS
What do you want to do with your data?
What I need:
Annual data on each country’s GDP
The region to which each country belongs
To do:
Move the yearly data from columns to rows
Clean up these column names so that they are easier to use in R
Add regional data
[1] "Country Name" "Country Code" "Indicator Name" "Indicator Code"
[5] "1960" "1961" "1962" "1963"
[9] "1964" "1965" "1966" "1967"
[13] "1968" "1969" "1970" "1971"
[17] "1972" "1973" "1974" "1975"
[21] "1976" "1977" "1978" "1979"
[25] "1980" "1981" "1982" "1983"
[29] "1984" "1985" "1986" "1987"
[33] "1988" "1989" "1990" "1991"
[37] "1992" "1993" "1994" "1995"
[41] "1996" "1997" "1998" "1999"
[45] "2000" "2001" "2002" "2003"
[49] "2004" "2005" "2006" "2007"
[53] "2008" "2009" "2010" "2011"
[57] "2012" "2013" "2014" "2015"
[61] "2016" "2017" "2018" "2019"
[65] "2020" "2021"
Column names should not:
Have spaces
Start with numbers
Introducing janitor:
Introducing the countrycode package - the indispensable workhorse of country data:
library(countrycode)
# Add each country's World Bank region to the data set
gdp_df <- gdp_df |>
mutate(region = countrycode(country_name, "country.name", "region", custom_match = c("Turkiye" = "Europe & Central Asia"))) |>
# Remove observations that are regions, eg. "East Asia & Pacific" or "World"
drop_na(region) |>
relocate(region, .after = "country_code")# A tibble: 10 × 2
country_name region
<chr> <chr>
1 Aruba Latin America & Caribbean
2 Afghanistan South Asia
3 Angola Sub-Saharan Africa
4 Albania Europe & Central Asia
5 Andorra Europe & Central Asia
6 United Arab Emirates Middle East & North Africa
7 Argentina Latin America & Caribbean
8 Armenia Europe & Central Asia
9 American Samoa East Asia & Pacific
10 Antigua and Barbuda Latin America & Caribbean
Rows: 13,454
Columns: 7
$ country_name <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "…
$ country_code <chr> "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW",…
$ region <chr> "Latin America & Caribbean", "Latin America & Caribbean…
$ indicator_name <chr> "GDP (current US$)", "GDP (current US$)", "GDP (current…
$ indicator_code <chr> "NY.GDP.MKTP.CD", "NY.GDP.MKTP.CD", "NY.GDP.MKTP.CD", "…
$ year <chr> "1960", "1961", "1962", "1963", "1964", "1965", "1966",…
$ gdp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
Rows: 13,454
Columns: 4
$ country_name <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Ar…
$ region <chr> "Latin America & Caribbean", "Latin America & Caribbean",…
$ year <int> 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 196…
$ gdp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
By default, R will carry forward NAs. This is good!
Today you:
Learnt how to read in external data
Learnt how to clean common data problems using R